import kagglehub
# Download latest version
= kagglehub.dataset_download("iamsouravbanerjee/house-rent-prediction-dataset")
path
print("Path to dataset files:", path)
Regression Main Concepts
լուսանկարի հղումը, Հեղինակ՝ Robert Levonyan
TOC
- Basic EDA
- Encoding categorical features 2.1 OHE 2.2 Label Encoding 2.3 Ordinal Encoding 2.4 Target Encoding
- Linear Regression with Scikit-learn
- Feature Scaling 4.1 MinMaxScaler 4.2 StandardScaler
Next lesson: 1. Train-test split 2. Underfitting and Overfitting 3. Cross-validation 4. Regularization
Kaggle data
https://www.kaggle.com/datasets/iamsouravbanerjee/house-rent-prediction-dataset/data
BHK: Number of Bedrooms, Hall, Kitchen.
Rent: Rent of the Houses/Apartments/Flats.
Size: Size of the Houses/Apartments/Flats in Square Feet.
Floor: Houses/Apartments/Flats situated in which Floor and Total Number of Floors (Example: Ground out of 2, 3 out of 5, etc.)
Area Type: Size of the Houses/Apartments/Flats calculated on either Super Area or Carpet Area or Build Area.
Area Locality: Locality of the Houses/Apartments/Flats.
City: City where the Houses/Apartments/Flats are Located.
Furnishing Status: Furnishing Status of the Houses/Apartments/Flats, either it is Furnished or Semi-Furnished or Unfurnished.
Tenant Preferred: Type of Tenant Preferred by the Owner or Agent.
Bathroom: Number of Bathrooms.
Point of Contact: Whom should you contact for more information regarding the Houses/Apartments/Flats.
import os
import pandas as pd
# os.listdir(path)
os.listdir()
['C1_Hyperparameter_Optimization.ipynb',
'C1_Regression_Main_Concepts.ipynb',
'Dataset Glossary.txt',
'data_lin_reg.csv',
'House_Rent_Dataset.csv',
'report.html']
# df = pd.read_csv(os.path.join(path, "House_Rent_Dataset.csv")) # /, \
= pd.read_csv("House_Rent_Dataset.csv") # /, \ df
df
Posted On | BHK | Rent | Size | Floor | Area Type | Area Locality | City | Furnishing Status | Tenant Preferred | Bathroom | Point of Contact | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-18 | 2 | 10000 | 1100 | Ground out of 2 | Super Area | Bandel | Kolkata | Unfurnished | Bachelors/Family | 2 | Contact Owner |
1 | 2022-05-13 | 2 | 20000 | 800 | 1 out of 3 | Super Area | Phool Bagan, Kankurgachi | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner |
2 | 2022-05-16 | 2 | 17000 | 1000 | 1 out of 3 | Super Area | Salt Lake City Sector 2 | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner |
3 | 2022-07-04 | 2 | 10000 | 800 | 1 out of 2 | Super Area | Dumdum Park | Kolkata | Unfurnished | Bachelors/Family | 1 | Contact Owner |
4 | 2022-05-09 | 2 | 7500 | 850 | 1 out of 2 | Carpet Area | South Dum Dum | Kolkata | Unfurnished | Bachelors | 1 | Contact Owner |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2022-05-18 | 2 | 15000 | 1000 | 3 out of 5 | Carpet Area | Bandam Kommu | Hyderabad | Semi-Furnished | Bachelors/Family | 2 | Contact Owner |
4742 | 2022-05-15 | 3 | 29000 | 2000 | 1 out of 4 | Super Area | Manikonda, Hyderabad | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Owner |
4743 | 2022-07-10 | 3 | 35000 | 1750 | 3 out of 5 | Carpet Area | Himayath Nagar, NH 7 | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Agent |
4744 | 2022-07-06 | 3 | 45000 | 1500 | 23 out of 34 | Carpet Area | Gachibowli | Hyderabad | Semi-Furnished | Family | 2 | Contact Agent |
4745 | 2022-05-04 | 2 | 15000 | 1000 | 4 out of 5 | Carpet Area | Suchitra Circle | Hyderabad | Unfurnished | Bachelors | 2 | Contact Owner |
4746 rows × 12 columns
Basic EDA (very important)
exploratory data analysis (EDA)
df
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Posted On 4746 non-null object
1 BHK 4746 non-null int64
2 Rent 4746 non-null int64
3 Size 4746 non-null int64
4 Floor 4746 non-null object
5 Area Type 4746 non-null object
6 Area Locality 4746 non-null object
7 City 4746 non-null object
8 Furnishing Status 4746 non-null object
9 Tenant Preferred 4746 non-null object
10 Bathroom 4746 non-null int64
11 Point of Contact 4746 non-null object
dtypes: int64(4), object(8)
memory usage: 445.1+ KB
assert df.isna().sum().sum() == 0, "Problem"
df.describe()
BHK | Rent | Size | Bathroom | |
---|---|---|---|---|
count | 4746.000000 | 4.746000e+03 | 4746.000000 | 4746.000000 |
mean | 2.083860 | 3.499345e+04 | 967.490729 | 1.965866 |
std | 0.832256 | 7.810641e+04 | 634.202328 | 0.884532 |
min | 1.000000 | 1.200000e+03 | 10.000000 | 1.000000 |
25% | 2.000000 | 1.000000e+04 | 550.000000 | 1.000000 |
50% | 2.000000 | 1.600000e+04 | 850.000000 | 2.000000 |
75% | 3.000000 | 3.300000e+04 | 1200.000000 | 2.000000 |
max | 6.000000 | 3.500000e+06 | 8000.000000 | 10.000000 |
= df.select_dtypes("object") df_categ
df_categ
Posted On | Floor | Area Type | Area Locality | City | Furnishing Status | Tenant Preferred | Point of Contact | |
---|---|---|---|---|---|---|---|---|
0 | 2022-05-18 | Ground out of 2 | Super Area | Bandel | Kolkata | Unfurnished | Bachelors/Family | Contact Owner |
1 | 2022-05-13 | 1 out of 3 | Super Area | Phool Bagan, Kankurgachi | Kolkata | Semi-Furnished | Bachelors/Family | Contact Owner |
2 | 2022-05-16 | 1 out of 3 | Super Area | Salt Lake City Sector 2 | Kolkata | Semi-Furnished | Bachelors/Family | Contact Owner |
3 | 2022-07-04 | 1 out of 2 | Super Area | Dumdum Park | Kolkata | Unfurnished | Bachelors/Family | Contact Owner |
4 | 2022-05-09 | 1 out of 2 | Carpet Area | South Dum Dum | Kolkata | Unfurnished | Bachelors | Contact Owner |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2022-05-18 | 3 out of 5 | Carpet Area | Bandam Kommu | Hyderabad | Semi-Furnished | Bachelors/Family | Contact Owner |
4742 | 2022-05-15 | 1 out of 4 | Super Area | Manikonda, Hyderabad | Hyderabad | Semi-Furnished | Bachelors/Family | Contact Owner |
4743 | 2022-07-10 | 3 out of 5 | Carpet Area | Himayath Nagar, NH 7 | Hyderabad | Semi-Furnished | Bachelors/Family | Contact Agent |
4744 | 2022-07-06 | 23 out of 34 | Carpet Area | Gachibowli | Hyderabad | Semi-Furnished | Family | Contact Agent |
4745 | 2022-05-04 | 4 out of 5 | Carpet Area | Suchitra Circle | Hyderabad | Unfurnished | Bachelors | Contact Owner |
4746 rows × 8 columns
df_categ.nunique()
Posted On 81
Floor 480
Area Type 3
Area Locality 2235
City 6
Furnishing Status 3
Tenant Preferred 3
Point of Contact 3
dtype: int64
df_categ.Floor.unique()
array(['Ground out of 2', '1 out of 3', '1 out of 2', 'Ground out of 1',
'Ground out of 4', '1 out of 4', '1 out of 1', 'Ground out of 3',
'2 out of 3', '4 out of 5', '2 out of 2', '2 out of 5',
'4 out of 14', '3 out of 3', '5 out of 5', '4 out of 4',
'7 out of 8', '2 out of 4', '3 out of 4', '1 out of 5',
'8 out of 5', 'Ground out of 6', '2 out of 1',
'Upper Basement out of 4', 'Ground out of 5', '3 out of 5',
'11 out of 19', '5 out of 10', '11 out of 14',
'Lower Basement out of 2', '2 out of 7', '4 out of 10',
'7 out of 10', '2 out of 13', '6 out of 7', '4 out of 7',
'14 out of 14', '43 out of 78', '2 out of 8', '13 out of 18',
'5 out of 12', '18 out of 24', '3 out of 7', '17 out of 31',
'11 out of 21', '7 out of 19', '14 out of 23', '9 out of 20',
'Upper Basement out of 9', '19 out of 24', '3 out of 21',
'1 out of 22', '8 out of 8', '6 out of 12', '4 out of 58',
'Upper Basement out of 16', '60 out of 66', '34 out of 48',
'5 out of 8', '5 out of 14', '14 out of 40', '5 out of 7',
'9 out of 22', '12 out of 18', '26 out of 44', '1 out of 8',
'25 out of 42', '25 out of 41', '53 out of 78', 'Ground out of 7',
'14 out of 20', '13 out of 20', '16 out of 23', '10 out of 18',
'39 out of 60', '16 out of 21', '10 out of 32', '4 out of 8',
'12 out of 24', '32 out of 41', '3 out of 30', '13 out of 21',
'9 out of 29', '47 out of 89', '7 out of 41', '28 out of 30',
'13 out of 15', '6 out of 21', '8 out of 16', '2 out of 6',
'5 out of 19', '3 out of 11', '17 out of 42', '10 out of 12',
'8 out of 28', '9 out of 15', '14 out of 22', '18 out of 40',
'9 out of 17', '12 out of 45', '25 out of 35', '7 out of 15',
'10 out of 16', 'Upper Basement out of 20', '5 out of 20',
'Upper Basement out of 40', '5 out of 18', '34 out of 58',
'4 out of 6', '20 out of 22', '12 out of 19', '15 out of 18',
'65 out of 78', '6 out of 16', '17 out of 22', '6 out of 24',
'40 out of 75', '19 out of 38', '15 out of 31', '11 out of 28',
'10 out of 22', '17 out of 24', '15 out of 19', '9 out of 10',
'7 out of 12', '8 out of 20', '11 out of 13', '9 out of 19',
'37 out of 51', '6 out of 11', '8 out of 15', '11 out of 20',
'10 out of 23', 'Upper Basement out of 10', '7 out of 23',
'4 out of 11', '17 out of 43', '7 out of 22', '14 out of 18',
'6 out of 10', '8 out of 12', '3 out of 18', '7 out of 7',
'14 out of 58', '18 out of 23', '19 out of 19', '13 out of 14',
'7 out of 11', '11 out of 22', 'Upper Basement out of 30',
'12 out of 14', '16 out of 31', '12 out of 13', '11 out of 51',
'2 out of 12', '22 out of 24', '7 out of 14', '5 out of 13',
'7 out of 21', '14 out of 21', '17 out of 25', '9 out of 14',
'8 out of 27', '3 out of 6', '17 out of 20', '18 out of 22',
'1 out of 7', '9 out of 30', '3 out of 8', '11 out of 26',
'17 out of 27', '4 out of 12', '12 out of 16', '10 out of 24',
'65 out of 76', '7 out of 9', '17 out of 60', '10 out of 11',
'18 out of 25', '5 out of 11', '15 out of 17', '15 out of 23',
'5 out of 17', '3 out of 28', '5 out of 24', '16 out of 32',
'21 out of 22', '7 out of 13', '9 out of 12', '15 out of 32',
'18 out of 27', '15 out of 16', '18 out of 45', '15 out of 15',
'6 out of 14', '1 out of 20', '16 out of 36', '30 out of 44',
'30 out of 37', '2 out of 9', '12 out of 22', '4 out of 9',
'2 out of 22', '5 out of 6', '6 out of 18', '35 out of 55',
'16 out of 29', '30 out of 45', '5 out of 9', '16 out of 25',
'33 out of 42', '4 out of 16', '13 out of 23', '9 out of 38',
'6 out of 8', '8 out of 13', '19 out of 30', '10 out of 14',
'11 out of 24', '9 out of 16', '9 out of 31', '4 out of 15',
'3 out of 9', '22 out of 30', '3 out of 58', '1 out of 9',
'53 out of 60', '5 out of 22', '15 out of 22', '19 out of 21',
'9 out of 40', 'Ground out of 8', '44 out of 75', '8 out of 17',
'3 out of 14', '12 out of 31', '26 out of 42', '2 out of 45',
'12 out of 68', '17 out of 36', '10 out of 28', '41 out of 41',
'14 out of 68', '14 out of 17', '15 out of 20', '46 out of 76',
'12 out of 20', '20 out of 30', '18 out of 32', '10 out of 25',
'17 out of 29', '10 out of 31', '10 out of 15', '13 out of 16',
'8 out of 10', '18 out of 21', '27 out of 58', '1 out of 6',
'19 out of 25', '3 out of 15', '25 out of 43', '8 out of 14',
'11 out of 12', '9 out of 21', '10 out of 13', '45 out of 77',
'18 out of 19', '10 out of 20', '12 out of 29',
'Lower Basement out of 18', '15 out of 24', '48 out of 68',
'12 out of 42', '16 out of 22', '35 out of 68', '18 out of 30',
'11 out of 31', '50 out of 75', '18 out of 26', '12 out of 27',
'16 out of 20', '24 out of 55', '16 out of 37',
'Upper Basement out of 7', '6 out of 15', '11 out of 27',
'11 out of 23', '3 out of 12', '14 out of 15', '23 out of 25',
'14 out of 48', '29 out of 35', '15 out of 36', '15 out of 25',
'15 out of 28', '3 out of 36', '8 out of 11', '6 out of 20',
'23 out of 23', '5 out of 15', '16 out of 18', '2 out of 10',
'40 out of 50', '25 out of 28', '12 out of 17', '34 out of 40',
'Upper Basement out of 22', '8 out of 23', '5 out of 21',
'32 out of 59', '20 out of 32', '9 out of 18', '10 out of 37',
'25 out of 48', '4 out of 22', '8 out of 18', '11 out of 11',
'5 out of 23', '60 out of 77', '11 out of 18', '4 out of 20',
'5 out of 16', '3 out of 13', '30 out of 58', '15 out of 43',
'7 out of 16', '18 out of 28', '9 out of 55', '11 out of 25',
'49 out of 55', '7 out of 27', '14 out of 27', '16 out of 27',
'25 out of 50', '6 out of 30', '21 out of 23', '8 out of 58',
'20 out of 41', '3 out of 62', '4 out of 13', '7 out of 17',
'12 out of 21', '28 out of 39', '15 out of 58', '6 out of 23',
'36 out of 45', '9 out of 28', '6 out of 45', '22 out of 52',
'10 out of 19', '21 out of 58', '48 out of 54', '7 out of 28',
'11 out of 15', '19 out of 22', '15 out of 37', '2 out of 17',
'76 out of 78', '3 out of 10', '20 out of 27', '8 out of 36',
'14 out of 33', '21 out of 21', '12 out of 25', '18 out of 29',
'14 out of 35', '7 out of 20', '20 out of 37', '9 out of 35',
'27 out of 27', '15 out of 60', '19 out of 33', '18 out of 20',
'13 out of 40', '9 out of 11', '8 out of 22', '6 out of 13',
'20 out of 31', '27 out of 45', '19 out of 20', '32 out of 46',
'19 out of 85', '3 out of 23', '34 out of 46', '4 out of 27',
'19 out of 27', '35 out of 60', '21 out of 33', '25 out of 52',
'2 out of 24', '24 out of 24', '18 out of 33', '1 out of 10',
'45 out of 60', '60 out of 71', '36 out of 81', '24 out of 60',
'16 out of 38', '8 out of 45', 'Ground out of 16', '8 out of 32',
'10 out of 10', '23 out of 40', '7 out of 18', '8 out of 19',
'6 out of 17', '16 out of 34', 'Ground out of 12', '6 out of 9',
'Ground out of 18', '20 out of 25', '3 out of 22', '9 out of 32',
'26 out of 32', '17 out of 18', '24 out of 25', '19 out of 26',
'17 out of 19', '1 out of 13', '14 out of 30', '8 out of 9',
'3 out of 17', 'Lower Basement out of 3', '12 out of 23',
'Ground out of 9', '1 out of 24', '1 out of 12', '3', 'Ground',
'15 out of 29', '20 out of 20', '14 out of 29',
'Lower Basement out of 1', '13 out of 17', '1 out of 14',
'Upper Basement out of 2', '2 out of 14', '24 out of 31',
'2 out of 32', '2 out of 16', '9 out of 13', '1 out of 11',
'6 out of 29', '9 out of 9', '28 out of 31', '1 out of 15',
'Ground out of 14', '2 out of 11', '19 out of 31', '1 out of 16',
'25 out of 32', '11 out of 16', '11 out of 17',
'Upper Basement out of 3', '4 out of 24', '1 out of 19',
'7 out of 30', '16 out of 19', 'Upper Basement out of 5',
'Ground out of 13', '2 out of 25', '23 out of 30', '4 out of 30',
'13 out of 25', '23 out of 35', 'Ground out of 10', '5 out of 34',
'20 out of 35', '1', '4 out of 31', '4 out of 26', '24 out of 33',
'4 out of 17', '1 out of 35', '11 out of 35', 'Ground out of 15',
'Ground out of 27', '15 out of 30', '12 out of 30', '23 out of 34'],
dtype=object)
= ["Area Locality", "Posted On", "Floor"]
COLS_TO_DROP
=COLS_TO_DROP, inplace=True)
df_categ.drop(columns=COLS_TO_DROP, inplace=True) df.drop(columns
"Area Type"].value_counts() df[
Area Type
Super Area 2446
Carpet Area 2298
Built Area 2
Name: count, dtype: int64
import plotly.express as px
"Area Type") df.value_counts(
Area Type
Super Area 2446
Carpet Area 2298
Built Area 2
Name: count, dtype: int64
"Area Type")) px.bar(df.value_counts(
Unable to display output for mime type(s): application/vnd.plotly.v1+json
* df["Area Type"] + c_1
0 Super Area
1 Super Area
2 Super Area
3 Super Area
4 Carpet Area
...
4741 Carpet Area
4742 Super Area
4743 Carpet Area
4744 Carpet Area
4745 Carpet Area
Name: Area Type, Length: 4746, dtype: object
Encoding categorical features
https://www.youtube.com/watch?v=589nCGeWG1w
One Hot Encoding
"City") df.value_counts(
pip install scikit-learn
or
conda install scikit-learn
NOT SKLEARN
from sklearn.preprocessing import OneHotEncoder
# https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.OneHotEncoder.html
= df[["City"]]
city city
City | |
---|---|
0 | Kolkata |
1 | Kolkata |
2 | Kolkata |
3 | Kolkata |
4 | Kolkata |
... | ... |
4741 | Hyderabad |
4742 | Hyderabad |
4743 | Hyderabad |
4744 | Hyderabad |
4745 | Hyderabad |
4746 rows × 1 columns
= OneHotEncoder(sparse_output=False)
ohe
ohe.fit(city)
OneHotEncoder(sparse_output=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
OneHotEncoder(sparse_output=False)
= ohe.transform(city) # fit_transform city_transformed
city_transformed
array([[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1., 0.],
...,
[0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 0.]])
pd.DataFrame(city_transformed)
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
4741 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4742 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4743 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4744 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4745 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4746 rows × 6 columns
ohe.get_feature_names_out()
array(['City_Bangalore', 'City_Chennai', 'City_Delhi', 'City_Hyderabad',
'City_Kolkata', 'City_Mumbai'], dtype=object)
= pd.DataFrame(city_transformed, columns=ohe.get_feature_names_out())
encoded_city encoded_city
City_Bangalore | City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
4741 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4742 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4743 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4744 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4745 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4746 rows × 6 columns
=1) pd.concat([df, encoded_city], axis
BHK | Rent | Size | Area Type | City | Furnishing Status | Tenant Preferred | Bathroom | Point of Contact | City_Bangalore | City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 10000 | 1100 | Super Area | Kolkata | Unfurnished | Bachelors/Family | 2 | Contact Owner | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 2 | 20000 | 800 | Super Area | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 2 | 17000 | 1000 | Super Area | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 2 | 10000 | 800 | Super Area | Kolkata | Unfurnished | Bachelors/Family | 1 | Contact Owner | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 2 | 7500 | 850 | Carpet Area | Kolkata | Unfurnished | Bachelors | 1 | Contact Owner | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2 | 15000 | 1000 | Carpet Area | Hyderabad | Semi-Furnished | Bachelors/Family | 2 | Contact Owner | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4742 | 3 | 29000 | 2000 | Super Area | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Owner | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4743 | 3 | 35000 | 1750 | Carpet Area | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Agent | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4744 | 3 | 45000 | 1500 | Carpet Area | Hyderabad | Semi-Furnished | Family | 2 | Contact Agent | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4745 | 2 | 15000 | 1000 | Carpet Area | Hyderabad | Unfurnished | Bachelors | 2 | Contact Owner | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4746 rows × 15 columns
Problem with OHE
encoded_city
City_Bangalore | City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... |
4741 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4742 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4743 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4744 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4745 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4746 rows × 6 columns
-1]#.sum(axis=1) encoded_city.iloc[:, :
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
...
4741 1.0
4742 1.0
4743 1.0
4744 1.0
4745 1.0
Length: 4746, dtype: float64
1 - encoded_city.iloc[:, :-1].sum(axis=1)
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
...
4741 0.0
4742 0.0
4743 0.0
4744 0.0
4745 0.0
Length: 4746, dtype: float64
= encoded_city.iloc[:, :-1].sum(axis=1) sum_of_rest
= 1 - sum_of_rest res
-1] == res encoded_city.iloc[:,
0 True
1 True
2 True
3 True
4 True
...
4741 True
4742 True
4743 True
4744 True
4745 True
Length: 4746, dtype: bool
all(encoded_city.iloc[:, -1] == res)
True
import numpy as np
= encoded_city @ encoded_city.T X_X_t
np.linalg.inv(X_X_t)
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) Input In [41], in <cell line: 1>() ----> 1 np.linalg.inv(X_X_t) File c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\numpy\linalg\linalg.py:561, in inv(a) 559 signature = 'D->D' if isComplexType(t) else 'd->d' 560 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 561 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 562 return wrap(ainv.astype(result_t, copy=False)) File c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\numpy\linalg\linalg.py:112, in _raise_linalgerror_singular(err, flag) 111 def _raise_linalgerror_singular(err, flag): --> 112 raise LinAlgError("Singular matrix") LinAlgError: Singular matrix
np.linalg.det(X_X_t)
0.0
Removing duplicates (det)
df.duplicated()
0 False
1 False
2 False
3 False
4 False
...
4741 False
4742 False
4743 False
4744 False
4745 False
Length: 4746, dtype: bool
any(df.duplicated()) np.
True
~df.duplicated] df[
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [46], in <cell line: 1>() ----> 1 df[~df.duplicated] TypeError: bad operand type for unary ~: 'method'
= df[~df.duplicated()] df
Solution
= OneHotEncoder(sparse_output=False, drop="first")
ohe
= ohe.fit_transform(city) encoded_city_fixed
=ohe.get_feature_names_out()) pd.DataFrame(encoded_city_fixed, columns
City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... |
4741 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4742 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4743 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4744 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4745 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4746 rows × 5 columns
OHE with Pandas
pd.get_dummies(city)
City_Bangalore | City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|---|
0 | False | False | False | False | True | False |
1 | False | False | False | False | True | False |
2 | False | False | False | False | True | False |
3 | False | False | False | False | True | False |
4 | False | False | False | False | True | False |
... | ... | ... | ... | ... | ... | ... |
4741 | False | False | False | True | False | False |
4742 | False | False | False | True | False | False |
4743 | False | False | False | True | False | False |
4744 | False | False | False | True | False | False |
4745 | False | False | False | True | False | False |
4746 rows × 6 columns
=True) pd.get_dummies(city, drop_first
City_Chennai | City_Delhi | City_Hyderabad | City_Kolkata | City_Mumbai | |
---|---|---|---|---|---|
0 | False | False | False | True | False |
1 | False | False | False | True | False |
2 | False | False | False | True | False |
3 | False | False | False | True | False |
4 | False | False | False | True | False |
... | ... | ... | ... | ... | ... |
4741 | False | False | True | False | False |
4742 | False | False | True | False | False |
4743 | False | False | True | False | False |
4744 | False | False | True | False | False |
4745 | False | False | True | False | False |
4746 rows × 5 columns
LabelEncoding
= df["Furnishing Status"] furnish
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le
= le.fit_transform(furnish)
furnish_le furnish_le
array([2, 1, 1, ..., 1, 1, 2])
"Furnish_le"] = furnish_le
df[
df
Posted On | BHK | Rent | Size | Floor | Area Type | Area Locality | City | Furnishing Status | Tenant Preferred | Bathroom | Point of Contact | Furnish_le | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-18 | 2 | 10000 | 1100 | Ground out of 2 | Super Area | Bandel | Kolkata | Unfurnished | Bachelors/Family | 2 | Contact Owner | 2 |
1 | 2022-05-13 | 2 | 20000 | 800 | 1 out of 3 | Super Area | Phool Bagan, Kankurgachi | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 1 |
2 | 2022-05-16 | 2 | 17000 | 1000 | 1 out of 3 | Super Area | Salt Lake City Sector 2 | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 1 |
3 | 2022-07-04 | 2 | 10000 | 800 | 1 out of 2 | Super Area | Dumdum Park | Kolkata | Unfurnished | Bachelors/Family | 1 | Contact Owner | 2 |
4 | 2022-05-09 | 2 | 7500 | 850 | 1 out of 2 | Carpet Area | South Dum Dum | Kolkata | Unfurnished | Bachelors | 1 | Contact Owner | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2022-05-18 | 2 | 15000 | 1000 | 3 out of 5 | Carpet Area | Bandam Kommu | Hyderabad | Semi-Furnished | Bachelors/Family | 2 | Contact Owner | 1 |
4742 | 2022-05-15 | 3 | 29000 | 2000 | 1 out of 4 | Super Area | Manikonda, Hyderabad | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Owner | 1 |
4743 | 2022-07-10 | 3 | 35000 | 1750 | 3 out of 5 | Carpet Area | Himayath Nagar, NH 7 | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Agent | 1 |
4744 | 2022-07-06 | 3 | 45000 | 1500 | 23 out of 34 | Carpet Area | Gachibowli | Hyderabad | Semi-Furnished | Family | 2 | Contact Agent | 1 |
4745 | 2022-05-04 | 2 | 15000 | 1000 | 4 out of 5 | Carpet Area | Suchitra Circle | Hyderabad | Unfurnished | Bachelors | 2 | Contact Owner | 2 |
4746 rows × 13 columns
"Furnishing Status") df.drop_duplicates(
Posted On | BHK | Rent | Size | Floor | Area Type | Area Locality | City | Furnishing Status | Tenant Preferred | Bathroom | Point of Contact | Furnish_le | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-18 | 2 | 10000 | 1100 | Ground out of 2 | Super Area | Bandel | Kolkata | Unfurnished | Bachelors/Family | 2 | Contact Owner | 2 |
1 | 2022-05-13 | 2 | 20000 | 800 | 1 out of 3 | Super Area | Phool Bagan, Kankurgachi | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 1 |
12 | 2022-05-14 | 1 | 6500 | 250 | 1 out of 4 | Carpet Area | Tarulia, Keshtopur | Kolkata | Furnished | Bachelors | 1 | Contact Owner | 0 |
"Area Type") df.value_counts(
Area Type
Super Area 2446
Carpet Area 2298
Built Area 2
Name: count, dtype: int64
= le.fit_transform(df["Area Type"]) area_type_le
"area_type_le"] = area_type_le df[
"Area Type") df.drop_duplicates(
Problem with LE
Super Area - Built Area = 2 * Carpet Area
Solution - Ordinal Encoding
= {
mappings "Unfurnished": 0,
"Semi-Furnished": 0.5,
"Furnished": 1
}
"Furnishing OE"] = df["Furnishing Status"].map(mappings) df[
df
Posted On | BHK | Rent | Size | Floor | Area Type | Area Locality | City | Furnishing Status | Tenant Preferred | Bathroom | Point of Contact | Furnish_le | Furnishing OE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-18 | 2 | 10000 | 1100 | Ground out of 2 | Super Area | Bandel | Kolkata | Unfurnished | Bachelors/Family | 2 | Contact Owner | 2 | 0.0 |
1 | 2022-05-13 | 2 | 20000 | 800 | 1 out of 3 | Super Area | Phool Bagan, Kankurgachi | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 1 | 0.5 |
2 | 2022-05-16 | 2 | 17000 | 1000 | 1 out of 3 | Super Area | Salt Lake City Sector 2 | Kolkata | Semi-Furnished | Bachelors/Family | 1 | Contact Owner | 1 | 0.5 |
3 | 2022-07-04 | 2 | 10000 | 800 | 1 out of 2 | Super Area | Dumdum Park | Kolkata | Unfurnished | Bachelors/Family | 1 | Contact Owner | 2 | 0.0 |
4 | 2022-05-09 | 2 | 7500 | 850 | 1 out of 2 | Carpet Area | South Dum Dum | Kolkata | Unfurnished | Bachelors | 1 | Contact Owner | 2 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2022-05-18 | 2 | 15000 | 1000 | 3 out of 5 | Carpet Area | Bandam Kommu | Hyderabad | Semi-Furnished | Bachelors/Family | 2 | Contact Owner | 1 | 0.5 |
4742 | 2022-05-15 | 3 | 29000 | 2000 | 1 out of 4 | Super Area | Manikonda, Hyderabad | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Owner | 1 | 0.5 |
4743 | 2022-07-10 | 3 | 35000 | 1750 | 3 out of 5 | Carpet Area | Himayath Nagar, NH 7 | Hyderabad | Semi-Furnished | Bachelors/Family | 3 | Contact Agent | 1 | 0.5 |
4744 | 2022-07-06 | 3 | 45000 | 1500 | 23 out of 34 | Carpet Area | Gachibowli | Hyderabad | Semi-Furnished | Family | 2 | Contact Agent | 1 | 0.5 |
4745 | 2022-05-04 | 2 | 15000 | 1000 | 4 out of 5 | Carpet Area | Suchitra Circle | Hyderabad | Unfurnished | Bachelors | 2 | Contact Owner | 2 | 0.0 |
4746 rows × 14 columns
Target Encoding (BE CAREFULL!!!)
https://www.youtube.com/watch?v=589nCGeWG1w
"Point of Contact"].value_counts() df[
Point of Contact
Contact Owner 3216
Contact Agent 1529
Contact Builder 1
Name: count, dtype: int64
"Point of Contact")["Rent"].mean() df.groupby(
Point of Contact
Contact Agent 73481.158927
Contact Builder 5500.000000
Contact Owner 16704.206468
Name: Rent, dtype: float64
pip install category_encoders
https://contrib.scikit-learn.org/category_encoders/
!pip install category_encoders
!pip uninstall -y scikit-learn
!pip install scikit-learn==1.5.2
!pip install category_encoders==2.5.0
import category_encoders as ce
= "Point of Contact"
col
= ce.TargetEncoder(cols=[col])
target_enc
'Rent'])
target_enc.fit(df[col], df[
'Point_of_Coutact_encoded'] = target_enc.transform(df[col]) df[
df
"Point of Contact")["Rent"].mean() df.groupby(
Some preprocessing
"Area Type"].value_counts() df[
= df[df["Area Type"] != "Built Area"]
df
"Area Type"].value_counts() df[
"Point of Contact"].value_counts() df[
= df[df["Point of Contact"] != "Contact Builder"] df
Putting all together
=["Point_of_Coutact_encoded"], inplace=True) df.drop(columns
= ["Area Type", "City", "Tenant Preferred", \
COLS_OHE "Point of Contact"]
= pd.get_dummies(df[COLS_OHE], drop_first=True) data_OHE
= pd.concat([df, data_OHE], axis=1) df
=COLS_OHE, inplace=True) df.drop(columns
= {
mappings "Unfurnished": 0,
"Semi-Furnished": 0.5,
"Furnished": 1
}"Furnishing Status"] = df["Furnishing Status"].map(mappings) df[
df
Linear Regression with Scikit-learn
df
= df[["BHK", "Rent", "Size", "Bathroom"]] df
= df
X = df["Rent"] y
1) X.head(
BHK | Rent | Size | Bathroom | |
---|---|---|---|---|
0 | 2 | 10000 | 1100 | 2 |
1) y.head(
0 10000
Name: Rent, dtype: int64
from sklearn.linear_model import LinearRegression
# statsmodel, xgboost
X
= LinearRegression()
model model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
print("Intercept:", model.intercept_)
print("Coefficient(s):", model.coef_)
Intercept: 7.275957614183426e-12
Coefficient(s): [-1.47712008e-11 1.00000000e+00 -6.34955518e-16 1.68842647e-12]
print("Intercept:", model.intercept_)
print("Coefficient(s):", model.coef_)
= pd.DataFrame({
coef_df 'Feature': X.columns,
'Coefficient': model.coef_
})
coef_df
Intercept: 7.275957614183426e-12
Coefficient(s): [-1.47712008e-11 1.00000000e+00 -6.34955518e-16 1.68842647e-12]
Feature | Coefficient | |
---|---|---|
0 | BHK | -1.477120e-11 |
1 | Rent | 1.000000e+00 |
2 | Size | -6.349555e-16 |
3 | Bathroom | 1.688426e-12 |
= '{:.6f}'.format
pd.options.display.float_format # pd.options.display.max_columns = 1000
print(coef_df)
Feature Coefficient
0 BHK -0.000000
1 Rent 1.000000
2 Size -0.000000
3 Bathroom 0.000000
= 0 * ameninch + 1 * Rent Rent
Fixing issue
X
= df.drop(columns=["Rent"])
X = df["Rent"]
y
= LinearRegression()
model
model.fit(X, y)
print("Intercept:", model.intercept_)
print("Coefficient(s):", model.coef_)
= pd.DataFrame({
coef_df 'Feature': X.columns,
'Coefficient': model.coef_
})
coef_df
Intercept: -38793.04450511425
Coefficient(s): [-1.54691474e+03 2.42409133e+01 2.72435614e+04]
Feature | Coefficient | |
---|---|---|
0 | BHK | -1546.914739 |
1 | Size | 24.240913 |
2 | Bathroom | 27243.561381 |
-38793 + BHK * -1546 + Size * 24
Interpreting coefficents
Note - In practice, we would first of all evaluate our model, and only then try to interpret it.
\(y_{initial} = \theta_0 + \theta_1 * x_1 + \theta_2 * x_2\)
Increase \(x_1\) by 1
$y_{new} = _0 + _1 * (x_1 + 1) + _2 * x_2 = _0 + _1 * x_1 + _2 * x_2 + 1 = y{initial} + _1 $
\(y_{new} - y_{initial} = \theta_1\)
- p centrus parabus (keeping other variables constant)
"Coefficient", ascending=False) coef_df.sort_values(
Feature | Coefficient | |
---|---|---|
2 | Bathroom | 27243.561381 |
1 | Size | 24.240913 |
0 | BHK | -1546.914739 |
Size = 39.747316 - For each additional square meter (or whichever unit “Size” represents), the rent on average increases by about 39.75 units of currency, holding other factors constant. (c. p)
City_Hyderabad = -15172.606293 - Being in Hyderabad (vs. baseline) on average leads to a 15172.61 decrease in predicted rent on average c. p.
Note - Ceteris paribus (a Latin phrase, meaning “other things equal”))
Important: We should not just sort the values, but rather sort them based on their absolute value
"Coefficient", key=abs, ascending=False) coef_df.sort_values(
Why is size so not important, it does not make any sense
"Size", "Bathroom"]].describe() df[[
Size | Bathroom | |
---|---|---|
count | 4746.000000 | 4746.000000 |
mean | 967.490729 | 1.965866 |
std | 634.202328 | 0.884532 |
min | 10.000000 | 1.000000 |
25% | 550.000000 | 1.000000 |
50% | 850.000000 | 2.000000 |
75% | 1200.000000 | 2.000000 |
max | 8000.000000 | 10.000000 |
Feature Scaling
"Size", "Bathroom"]) px.histogram(df, [
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Outlier
Min Max Scaling
1, 5, 10
/ 10
0.1, 0.5, 1
1, 5, 1000000
0......1, 0....5
-1000, 1, 5, 10
# 0.9,0.99, 1
-100, 0.1, 0.5, 1
---
0, 100.1, 100.5, 100.1
0, 0.99, 0.995, 1
"Size") px.histogram(df,
Unable to display output for mime type(s): application/vnd.plotly.v1+json
def min_max_scale(df, col):
return (df[col] - df[col].min()) / (df[col].max() - df[col].min())
"Size_min_max"] = min_max_scale(df, "Size")
df["Bathroom_min_max"] = min_max_scale(df, "Bathroom")
df[
"Size_min_max", "Bathroom_min_max"]) px.histogram(df, [
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\1506725833.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\1506725833.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Unable to display output for mime type(s): application/vnd.plotly.v1+json
from sklearn.preprocessing import MinMaxScaler
= MinMaxScaler()
scaler
'Size', 'Bathroom']])
scaler.fit(df[[
= scaler.transform(df[['Size', 'Bathroom']])
scaled_data
'Size_minmax', 'Bathroom_minmax']] = scaled_data
df[[
-4:] df.iloc[:,
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\710089726.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\710089726.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Size_min_max | Bathroom_min_max | Size_minmax | Bathroom_minmax | |
---|---|---|---|---|
0 | 0.136421 | 0.111111 | 0.136421 | 0.111111 |
1 | 0.098874 | 0.000000 | 0.098874 | 0.000000 |
2 | 0.123905 | 0.000000 | 0.123905 | 0.000000 |
3 | 0.098874 | 0.000000 | 0.098874 | 0.000000 |
4 | 0.105131 | 0.000000 | 0.105131 | 0.000000 |
... | ... | ... | ... | ... |
4741 | 0.123905 | 0.111111 | 0.123905 | 0.111111 |
4742 | 0.249061 | 0.222222 | 0.249061 | 0.222222 |
4743 | 0.217772 | 0.222222 | 0.217772 | 0.222222 |
4744 | 0.186483 | 0.111111 | 0.186483 | 0.111111 |
4745 | 0.123905 | 0.111111 | 0.123905 | 0.111111 |
4746 rows × 4 columns
Standard Scaling
ToDo - add plit showing the transformation
-> N(0, 1)
Standard normal
2, 3, 4 -> 3
999, 1000, 1001 -> 1000
def standard_scale(df, col):
"""
Standardize a single column to have mean 0 and std dev 1:
z = (x - mean) / std
"""
return (df[col] - df[col].mean()) / df[col].std()
"Size_std_manual"] = standard_scale(df, "Size")
df["Bathroom_std_manual"] = standard_scale(df, "Bathroom") df[
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\2451075589.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\2451075589.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler
'Size', 'Bathroom']])
scaler.fit(df[[
= scaler.transform(df[['Size', 'Bathroom']])
scaled_data
'Size_standard', 'Bathroom_standard']] = scaled_data
df[[
-4:] df.iloc[:,
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\634967286.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
C:\Users\hayk_\AppData\Local\Temp\ipykernel_19192\634967286.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Size_std_manual | Bathroom_std_manual | Size_standard | Bathroom_standard | |
---|---|---|---|---|
0 | 0.208938 | 0.038590 | 0.208960 | 0.038594 |
1 | -0.264097 | -1.091952 | -0.264125 | -1.092067 |
2 | 0.051260 | -1.091952 | 0.051265 | -1.092067 |
3 | -0.264097 | -1.091952 | -0.264125 | -1.092067 |
4 | -0.185257 | -1.091952 | -0.185277 | -1.092067 |
... | ... | ... | ... | ... |
4741 | 0.051260 | 0.038590 | 0.051265 | 0.038594 |
4742 | 1.628044 | 1.169132 | 1.628216 | 1.169255 |
4743 | 1.233848 | 1.169132 | 1.233978 | 1.169255 |
4744 | 0.839652 | 0.038590 | 0.839741 | 0.038594 |
4745 | 0.051260 | 0.038590 | 0.051265 | 0.038594 |
4746 rows × 4 columns
"Size_std_manual", "Bathroom_std_manual"]) px.histogram(df, [
Unable to display output for mime type(s): application/vnd.plotly.v1+json
df
BHK | Rent | Size | Bathroom | Size_min_max | Bathroom_min_max | Size_minmax | Bathroom_minmax | Size_std_manual | Bathroom_std_manual | Size_standard | Bathroom_standard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 10000 | 1100 | 2 | 0.136421 | 0.111111 | 0.136421 | 0.111111 | 0.208938 | 0.038590 | 0.208960 | 0.038594 |
1 | 2 | 20000 | 800 | 1 | 0.098874 | 0.000000 | 0.098874 | 0.000000 | -0.264097 | -1.091952 | -0.264125 | -1.092067 |
2 | 2 | 17000 | 1000 | 1 | 0.123905 | 0.000000 | 0.123905 | 0.000000 | 0.051260 | -1.091952 | 0.051265 | -1.092067 |
3 | 2 | 10000 | 800 | 1 | 0.098874 | 0.000000 | 0.098874 | 0.000000 | -0.264097 | -1.091952 | -0.264125 | -1.092067 |
4 | 2 | 7500 | 850 | 1 | 0.105131 | 0.000000 | 0.105131 | 0.000000 | -0.185257 | -1.091952 | -0.185277 | -1.092067 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4741 | 2 | 15000 | 1000 | 2 | 0.123905 | 0.111111 | 0.123905 | 0.111111 | 0.051260 | 0.038590 | 0.051265 | 0.038594 |
4742 | 3 | 29000 | 2000 | 3 | 0.249061 | 0.222222 | 0.249061 | 0.222222 | 1.628044 | 1.169132 | 1.628216 | 1.169255 |
4743 | 3 | 35000 | 1750 | 3 | 0.217772 | 0.222222 | 0.217772 | 0.222222 | 1.233848 | 1.169132 | 1.233978 | 1.169255 |
4744 | 3 | 45000 | 1500 | 2 | 0.186483 | 0.111111 | 0.186483 | 0.111111 | 0.839652 | 0.038590 | 0.839741 | 0.038594 |
4745 | 2 | 15000 | 1000 | 2 | 0.123905 | 0.111111 | 0.123905 | 0.111111 | 0.051260 | 0.038590 | 0.051265 | 0.038594 |
4746 rows × 12 columns
= df[["BHK", "Size_standard","Bathroom_standard"]] df_scaled
= df_scaled#.drop(columns=["Rent"])
X = df["Rent"]
y
= LinearRegression()
model
model.fit(X, y)
print("Intercept:", model.intercept_)
print("Coefficient(s):", model.coef_)
= pd.DataFrame({
coef_df 'Feature': X.columns,
'Coefficient': model.coef_
})
coef_df
Intercept: 38217.00521962237
Coefficient(s): [-1546.91473937 15372.02391618 24095.25384968]
Feature | Coefficient | |
---|---|---|
0 | BHK | -1546.914739 |
1 | Size_standard | 15372.023916 |
2 | Bathroom_standard | 24095.253850 |
1 - 100000000
1 - 1.1
x, y
Train Test Split
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import plotly.express as px
import plotly.graph_objects as go
509)
np.random.seed(
= 100
num_samples = 4
theta_0_true = 3
theta_1_true
= np.random.rand(num_samples, 1)
X = theta_0_true + theta_1_true * X
y = pd.DataFrame({'x': X.flatten(), 'y': y.flatten()})
df df
x | y | |
---|---|---|
0 | 0.755873 | 6.267619 |
1 | 0.250878 | 4.752635 |
2 | 0.705838 | 6.117513 |
3 | 0.377670 | 5.133011 |
4 | 0.722901 | 6.168702 |
... | ... | ... |
95 | 0.384609 | 5.153826 |
96 | 0.804502 | 6.413507 |
97 | 0.626121 | 5.878364 |
98 | 0.838703 | 6.516109 |
99 | 0.126067 | 4.378202 |
100 rows × 2 columns
"x", "y") px.scatter(df,
Unable to display output for mime type(s): application/vnd.plotly.v1+json
SETTING THE SEED IS IMPORTANT
import numpy as np
1)
np.random.seed(
print(np.random.randint(1,100))
print(np.random.randint(1,100))
38
13
Adding noise
50004)
np.random.seed(
= np.random.rand(200, 1)
X = theta_0_true + theta_1_true * X + np.random.randn(200, 1) / 3
y = pd.DataFrame({'x': X.flatten(), 'y': y.flatten()}) df
= np.linspace(0, 1, 100)
x_vals = theta_0_true + theta_1_true * x_vals
y_vals
= px.scatter(df, "x", "y")
fig
=x_vals, y=y_vals, mode='lines', name='True Line'))
fig.add_trace(go.Scatter(x
fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
First work with a few points
= df.sort_values("x").head(3)
df_small df_small
x | y | |
---|---|---|
193 | 0.000035 | 4.266323 |
5 | 0.007691 | 3.423349 |
167 | 0.009319 | 3.936942 |
# show fig but xlim to 0.012
=dict(range=[df_small.x.min()-0.01, df_small.x.max()+0.01]),
fig.update_layout(xaxis=dict(range=[df_small.y.min()-0.01, df_small.y.max()+0.01]))
yaxis fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
= df_small[["x"]]
X = df_small["y"] y
10th degree polynomial
= pd.DataFrame()
data
"intercept"] = np.ones(7)
data["x"] = np.array([1,2,3,4,5,6,7])
data["x2"] = data["x"] ** 2
data["x3"] = data["x"] ** 3
data[
data
intercept | x | x2 | x3 | |
---|---|---|---|---|
0 | 1.0 | 1 | 1 | 1 |
1 | 1.0 | 2 | 4 | 8 |
2 | 1.0 | 3 | 9 | 27 |
3 | 1.0 | 4 | 16 | 64 |
4 | 1.0 | 5 | 25 | 125 |
5 | 1.0 | 6 | 36 | 216 |
6 | 1.0 | 7 | 49 | 343 |
from sklearn.preprocessing import PolynomialFeatures
= df_small[["x"]]
X = df_small["y"]
y
= np.linspace(X.min(), X.max(), 100).reshape(-1,1)
x_vals
# Polynomial regression
= PolynomialFeatures(degree=3)
poly = poly.fit_transform(X)
X_poly = LinearRegression()
model_poly model_poly.fit(X_poly, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
model_poly.coef_
array([-1.15943521e-09, -4.64230870e+02, 4.58301265e+04, 7.81205706e+02])
1 * -1.159 + 8 + -4.64 + 8^^2 4.58301265e
Input In [64] 1 * -1.159 + 8 + -4.64 + 8^^2 4.58301265e ^ SyntaxError: invalid decimal literal
pd.DataFrame(X_poly), y
( 0 1 2 3
0 1.0 0.000035 1.212560e-09 4.222358e-14
1 1.0 0.007691 5.915409e-05 4.549641e-07
2 1.0 0.009319 8.684297e-05 8.092860e-07,
193 4.266323
5 3.423349
167 3.936942
Name: y, dtype: float64)
from sklearn.preprocessing import PolynomialFeatures
= df_small[["x"]]
X = df_small["y"]
y
= np.linspace(X.min(), X.max(), 100).reshape(-1,1)
x_vals
# X = pd.DataFrame(np.array([1,2,3]))
# Polynomial regression
= PolynomialFeatures(degree=10)
poly = poly.fit_transform(X)
X_poly
# print(X)
# print(pd.DataFrame(X_poly))
= LinearRegression()
model_poly
model_poly.fit(X_poly, y)
# Predictions for polynomial regression
= poly.transform(x_vals)
x_vals_poly = model_poly.predict(x_vals_poly)
y_pred_poly
# Linear regression
= LinearRegression()
model_line
model_line.fit(X, y)
# Predictions for linear regression
= model_line.predict(x_vals)
y_pred_line
# Plotting
= px.scatter(df_small, "x", "y", title="Actual Data, True Line, Line Fit, and Poly Fit")
fig =x_vals.flatten(), y=theta_0_true + theta_1_true * x_vals.flatten(), mode='lines', name='True Line'))
fig.add_trace(go.Scatter(x=x_vals.flatten(), y=y_pred_poly.flatten(), mode='lines', name='Poly Fit'))
fig.add_trace(go.Scatter(x=x_vals.flatten(), y=y_pred_line.flatten(), mode='lines', name='Line Fit'))
fig.add_trace(go.Scatter(x fig.show()
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\base.py:493: UserWarning:
X does not have valid feature names, but PolynomialFeatures was fitted with feature names
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\base.py:493: UserWarning:
X does not have valid feature names, but LinearRegression was fitted with feature names
Unable to display output for mime type(s): application/vnd.plotly.v1+json
= df[["x"]]
X_all = df["y"]
y_all
# predict the models on all data
= model_line.predict(X_all)
y_pred_line_all = model_poly.predict(poly.transform(X_all))
y_pred_poly_all
# plot the predictions
= px.scatter(df, "x", "y", title="Actual Data, Line Fit, Poly Fit")
fig =df["x"], y=y_pred_line_all.flatten(), mode='markers', name='Line Fit'))
fig.add_trace(go.Scatter(x=df["x"], y=y_pred_poly_all.flatten(), mode='markers', name='Poly Fit'))
fig.add_trace(go.Scatter(x=x_vals.flatten(), y=theta_0_true + theta_1_true * x_vals.flatten(), mode='lines', name='True Line'))
fig.add_trace(go.Scatter(x
fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Errors
= mean_squared_error(y, model_line.predict(X))
small_line_mse = mean_squared_error(y, model_poly.predict(poly.transform(X)))
small_poly_mse
= mean_squared_error(y_all, model_line.predict(X_all))
all_line_mse = mean_squared_error(y_all, model_poly.predict(poly.transform(X_all)))
all_poly_mse
print(f"Small Data MSE - Line: {small_line_mse:.5f}")
print(f"Small Data MSE - Poly: {small_poly_mse:.5f}")
print()
print(f"All Data MSE - Line: {all_line_mse:.5f}")
print(f"All Data MSE - Poly: {all_poly_mse:.5f}")
Small Data MSE - Line: 0.06360
Small Data MSE - Poly: 0.00000
All Data MSE - Line: 1197.68639
All Data MSE - Poly: 370576183.33468
Solution - split the data
= 30
num_samples
= np.random.rand(num_samples, 1)
X = theta_0_true + theta_1_true * X + np.random.randn(num_samples, 1) / 3
y
= pd.DataFrame({'x': X.flatten(), 'y': y.flatten()})
df
= df[["x"]]
X = df["y"]
y "x", "y") px.scatter(df,
Unable to display output for mime type(s): application/vnd.plotly.v1+json
from sklearn.model_selection import train_test_split
= train_test_split(X, y,
X_train, X_test, y_train, y_test =0.2,
test_size=509)
random_state X_train.shape, X_test.shape, y_train.shape, y_test.shape
((24, 1), (6, 1), (24,), (6,))
# Linear regression
= LinearRegression()
model_line
model_line.fit(X_train, y_train)
= model_line.predict(X_train)
y_train_pred_line = model_line.predict(X_test)
y_test_pred_line
# Polynomial regression
= LinearRegression()
model_poly = PolynomialFeatures(degree=40)
poly
poly.fit(X_train)
model_poly.fit(poly.transform(X_train), y_train)
= model_poly.predict(poly.transform(X_train))
y_train_pred_poly
= model_poly.predict(poly.transform(X_test)) y_test_pred_poly
Just a line
print("Train MSE - Line:", mean_squared_error(y_train, y_train_pred_line))
print("Test MSE - Line:", mean_squared_error(y_test, y_test_pred_line))
print()
print("Train MSE - Poly:", mean_squared_error(y_train, y_train_pred_poly))
print("Test MSE - Poly:", mean_squared_error(y_test, y_test_pred_poly))
import plotly.express as px
# Bar plots for Train and Test MSE
= {
mse_data 'Data': ['Train', 'Test'],
'Line': [mean_squared_error(y_train, y_train_pred_line), mean_squared_error(y_test, y_test_pred_line)],
'Poly': [mean_squared_error(y_train, y_train_pred_poly), mean_squared_error(y_test, y_test_pred_poly)]
}
= pd.DataFrame(mse_data)
mse_df
= px.bar(mse_df, x='Data', y=['Line', 'Poly'], barmode='group', title='Train and Test MSE for Line and Poly Models')
fig ='Mean Squared Error')
fig.update_layout(yaxis_title fig.show()
Train MSE - Line: 0.08503000584047282
Test MSE - Line: 0.06293227999615157
Train MSE - Poly: 0.006204590907118421
Test MSE - Poly: 378832.97000561305
Unable to display output for mime type(s): application/vnd.plotly.v1+json
# x_vals = np.linspace()
# Sort the dataframe by 'x' values
= df.sort_values(by='x')
df_sorted
# Extrapolate model to predict the entire range of x values
= model_line.predict(df_sorted[['x']])
y_pred_line = model_poly.predict(poly.transform(df_sorted[['x']]))
y_pred_poly
= px.scatter(df_sorted, "x", "y", title="Actual Data, Line Fit, Poly Fit")
fig =df_sorted["x"], y=y_pred_line.flatten(), mode='lines', name='Line Fit'))
fig.add_trace(go.Scatter(x=df_sorted["x"], y=y_pred_poly.flatten(), mode='lines', name='Poly Fit'))
fig.add_trace(go.Scatter(x=x_vals.flatten(), y=theta_0_true + theta_1_true * x_vals.flatten(), mode='lines', name='True Line'))
fig.add_trace(go.Scatter(x
fig.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
=dict(range=[X.min(), X.max()]),
fig.update_layout(xaxis=dict(range=[y.min(), y.max()])) yaxis
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Underfitting and Overfitting
- Շարքային Սողոմոնյան - https://www.youtube.com/watch?v=pv_4MVcjEik
- Եթե կլոր ա ուրեմն մեջինը քառակուսի ա - https://www.youtube.com/watch?v=7arnxebkEUU
Examples
Overfitting: Astrological Predictions in Ancient Civilizations**
Historical Backdrop: From Babylonian times onward, countless astrologers meticulously charted the positions of celestial bodies, connecting them with floods, famines, victories in war, and the births of royals.
Why it’s Overfitting: - Ancient astrologers often looked at every tiny “coincidence” between a planetary alignment and historical events, building extremely specific rules. (E.g., “When Mars is in Taurus and the moon is half full, there will be a great harvest if the newborn prince is left-handed!”) - These detailed “models” fit prior observations too well, often capturing noise and coincidences rather than robust truths.
Moral of the Story: Squeezing meaning out of every alignment of the stars is like overfitting on random noise in a dataset!
Underfitting: “Bleeding” as a Medieval Medical Treatment
Historical Backdrop: For centuries, a common medical practice in Europe was bloodletting—draining blood to “rebalance the humors” and cure ailments from headaches to fevers.
Why it’s Underfitting: - The medical “model” at the time was extremely simplistic: “Something’s wrong? Let’s remove blood.” - They applied the same one-size-fits-all approach to all sorts of diseases, ignoring the huge variability between different medical conditions (and patients). - Because the underlying theory was so rudimentary (the four humors concept), the “model” rarely fit the real complexity of physiology.
Moral of the Story: When your theory is too general and ignores most of the nuanced details, you’re underfitting the complexity of reality (and might end up making people worse).
Overfitting: Your Uncle’s Hyper-Specific Sports Superstitions
Everyday Fun Example: Maybe you have an uncle who insists on wearing the exact same (unwashed) socks during every big game, needs to place the remote exactly 5 inches from the TV, and can only eat “lucky peanuts” if the score is tied.
Why it’s Overfitting: - He’s discovered a string of coincidences: whenever he did those specific rituals, his team happened to win. - He’s latched onto every tiny detail—like someone building an overly complex machine-learning model that memorizes all the noise in the training data. - The moment “new data” arrives—i.e., the team loses despite the lucky peanuts—his model is proven to have no real predictive power.
Moral of the Story: If your “model” requires that many hyper-specific conditions to “succeed,” it’s probably not robust!
Demo
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# ------------------------------------------------------------------------------
# 1. Generate Synthetic Data (True degree = 3)
# ------------------------------------------------------------------------------
509)
np.random.seed(= 20
poly_degree
# True polynomial function (3rd degree)
def true_function(x):
# Example: y = 1 + 2x + 3x^2 + 4x^3, plus random noise
return 1 + 2*x + 3*x**2 + 4*x**3 + np.random.normal(0, 20, size=x.shape)
def true_function_without_noise(x):
return 1 + 2*x + 3*x**2 + 4*x**3
# Generate data
= 100
N = np.linspace(-3, 3, N).reshape(-1, 1)
X = true_function(X.ravel())
y
# Split into train and test sets
= train_test_split(X, y,
X_train, X_test, y_train, y_test =0.3,
test_size=42)
random_state
# ------------------------------------------------------------------------------
# 2. Fit Polynomial Models of Degree 1 to 10 & Record MSE
# ------------------------------------------------------------------------------
= range(1, poly_degree)
degrees = []
train_mses = []
test_mses = {} # Store predictions for plotting
polynomial_predictions
# A dense grid for plotting model predictions
= np.linspace(X.min(), X.max(), 200).reshape(-1, 1)
X_plot
for d in degrees:
# Create polynomial features
= PolynomialFeatures(degree=d)
poly = poly.fit_transform(X_train)
X_train_poly = poly.transform(X_test)
X_test_poly = poly.transform(X_plot)
X_plot_poly
# Fit linear regression on polynomial features
= LinearRegression()
model
model.fit(X_train_poly, y_train)
# Predict on train and test
= model.predict(X_train_poly)
y_train_pred = model.predict(X_test_poly)
y_test_pred
# Calculate MSE
= mean_squared_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_test, y_test_pred)
test_mse
train_mses.append(train_mse)
test_mses.append(test_mse)
# Store predictions for the plotting slider
= model.predict(X_plot_poly)
y_plot_pred = y_plot_pred
polynomial_predictions[d]
# ------------------------------------------------------------------------------
# 3. Plot MSE vs. Polynomial Degree (Line Chart)
# ------------------------------------------------------------------------------
= {
df_mse 'Degree': list(degrees),
'Train MSE': train_mses,
'Test MSE': test_mses
}
= px.line(
fig_mse
df_mse, ='Degree',
x=['Train MSE', 'Test MSE'],
y=True,
markers="Train & Test MSE vs. Polynomial Degree"
title
)
fig_mse.update_layout(= dict(dtick=1),
xaxis ="MSE"
yaxis_title
)
# ------------------------------------------------------------------------------
# 4. Interactive Plot: Data + Fitted Polynomials (Slider)
# ------------------------------------------------------------------------------
# We'll create a figure with:
# - Scatter of the training data
# - Scatter of the test data (optional, or we can mark them differently)
# - A line that updates for each polynomial degree using frames.
# Base scatter (training data)
= go.Scatter(
scatter_train =X_train.ravel(),
x=y_train,
y='markers',
mode='Train Data',
name=dict(color='blue', size=6)
marker
)
# Optionally, scatter for test data
= go.Scatter(
scatter_test =X_test.ravel(),
x=y_test,
y='markers',
mode='Test Data',
name=dict(color='red', size=6)
marker
)
# We'll build frames for each polynomial degree
= []
frames for d in degrees:
# Create a line trace for the polynomial prediction at degree d
= go.Scatter(
line_pred =X_plot.ravel(),
x=polynomial_predictions[d],
y='lines',
mode=dict(width=3),
line=f"Degree {d} fit"
name
)
frames.append(
go.Frame(=[scatter_train, scatter_test, line_pred],
data=str(d)
name
)
)
# Initial line (degree=1 by default)
= go.Scatter(
init_line =X_plot.ravel(),
x=polynomial_predictions[1],
y='lines',
mode=dict(width=3),
line=f"Degree 1 fit"
name
)
# Build the figure with the first frame's data
= go.Figure(
fig_poly =[scatter_train, scatter_test, init_line],
data=go.Layout(
layout="Polynomial Fits (Degree Slider)",
title=dict(title="X"),
xaxis=dict(title="y"),
yaxis=[ # Slider button settings
updatemenusdict(
type="buttons",
=False,
showactive=1.15,
x=1.15,
y="right",
xanchor="top",
yanchor=[
buttonsdict(label="Play",
="animate",
method=[None,
argsdict(frame=dict(duration=500, redraw=True),
=True,
fromcurrent=dict(duration=300)
transition
)
]
)
]
)
],# We'll define sliders next
=[{
sliders'currentvalue': {'prefix': 'Degree: ', 'xanchor': 'right'},
'steps': [
{'label': str(d),
'method': 'animate',
'args': [[str(d)],
dict(mode='immediate',
=dict(duration=300, redraw=True),
frame=dict(duration=300))]
transition
}for d in degrees
]
}]
),=frames
frames
)
# add the true function to fig_poly
= true_function_without_noise(X_plot.ravel())
true_y =X_plot.ravel(), y=true_y, mode='lines', name='True Function'))
fig_poly.add_trace(go.Scatter(x
# ------------------------------------------------------------------------------
# 5. Show the plots
# ------------------------------------------------------------------------------
fig_poly.show() fig_mse.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Bias Variance decomposition
https://scott.fortmann-roe.com/docs/BiasVariance.html
x_vals
Evaluation
Classics
Go to PDF ml\Chapter 1 Regression Main Concepts\PDF\L02 Regression Losses.pdf
”
Pearson and Spearman Correlation
### Pearson's vs Spearman's Correlation in Python
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import pearsonr, spearmanr
# Generate sample data
509)
np.random.seed(= np.linspace(1, 100, 100)
x = 2 * x + np.random.normal(0, 10, size=len(x)) # Linear relationship
y_linear = np.log(x) + np.random.normal(0, 0.2, size=len(x)) # Monotonic but not linear
y_monotonic = y_linear.copy()
y_outlier -1] += 50_900 # Add an outlier y_outlier[
# Calculate correlation values
= pearsonr(x, y_linear)[0]
pearson_linear = spearmanr(x, y_linear)[0]
spearman_linear
= pearsonr(x, y_monotonic)[0]
pearson_monotonic = spearmanr(x, y_monotonic)[0]
spearman_monotonic
= pearsonr(x, y_outlier)[0]
pearson_outlier = spearmanr(x, y_outlier)[0] spearman_outlier
import plotly.express as px
import plotly.graph_objects as go
# Pearson's linear correlation example
= px.scatter(x=x, y=y_linear, title=f"Linear Relationship | Pearson - {pearson_linear:.2f} | Spearman - {spearman_linear:.2f}")
fig1 =x, y=y_linear, mode='markers'))
fig1.add_trace(go.Scatter(x="X", yaxis_title="Y")
fig1.update_layout(xaxis_title
fig1.show()
# Spearman's monotonic relationship example
= px.scatter(x=x, y=y_monotonic, title=f"Monotonic Relationship | Pearson - {pearson_monotonic:.2f} | Spearman - {spearman_monotonic:.2f}")
fig2 =x, y=y_monotonic, mode='markers'))
fig2.add_trace(go.Scatter(x="X", yaxis_title="Y (Rank-based)")
fig2.update_layout(xaxis_title
fig2.show()
# Pearson with outlier example
= px.scatter(x=x, y=y_outlier, title=f"Outlier | Pearson - {pearson_outlier:.2f} | Spearman - {spearman_outlier:.2f}")
fig3 =x, y=y_outlier, mode='markers'))
fig3.add_trace(go.Scatter(x="X", yaxis_title="Y")
fig3.update_layout(xaxis_title
fig3.show()
# Spearman on sine wave
= np.linspace(0, 1 * np.pi, 100)
x = np.sin(x)
y
= pearsonr(x, y)[0]
pearson_sine = spearmanr(x, y)[0]
spearman_sine
= px.scatter(x=x, y=y, title=f"Sine Wave | Pearson - {pearson_sine:.2f} | Spearman - {spearman_sine:.2f}")
fig4 =x, y=y, mode='markers'))
fig4.add_trace(go.Scatter(x="X", yaxis_title="Y")
fig4.update_layout(xaxis_title fig4.show()
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Pearson’s Correlation (r)
- Measures the linear relationship between two variables.
- Assumes data is normally distributed.
- Sensitive to outliers.
- Values range from -1 to +1:
- +1: Perfect positive linear correlation.
- -1: Perfect negative linear correlation.
- 0: No linear correlation.
Formula:
\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} \]
Spearman’s Correlation (ρ)
- Measures the monotonic relationship between two variables.
- Does not assume normality (non-parametric).
- Robust against outliers.
- Values range from -1 to +1:
- +1: Perfect positive monotonic relationship.
- -1: Perfect negative monotonic relationship.
- 0: No monotonic relationship.
Formula:
\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \] - \(d_i\): Difference between ranks of \(x_i\) and \(y_i\). - \(n\): Number of pairs.
When to Use:
Criteria | Pearson | Spearman |
---|---|---|
Linear Relationship | Yes | No |
Monotonic Relationship | No | Yes |
Outlier Sensitivity | High | Low |
Normality Assumption | Yes | No |
““”
Visual evaluation + Codes
from sklearn.linear_model import LinearRegression
= LinearRegression()
model
model.fit(X_train_poly, y_train)
= model.predict(X_train_poly)
y_train_pred = model.predict(X_test_poly) y_test_pred
# https://github.com/HaykTarkhanyan/coder_moder/blob/main/ml/regression_evaluation_report.py
def evaluate_regression(actual, predictions,
=None, filename=None, notes=None,
model_name=False, show_plots=False,
return_metrics=True, plots=False, round_digits=3):
show_metrics"""
Function to evaluate a regression model.
.. warning::
Assumes that ``scipy``, ``sklearn``, and ``matplotlib`` are installed
in your environment.
This function:
- Prints R2, MAE, MSE, RMSE metrics.
- Prints Kendall's tau, Pearson's R, Spearman's rho correlation metrics.
- Plots actual vs. predicted values.
- Plots residuals vs. predicted values.
- Plots distribution of residuals.
- Plots predicted vs. actual distribution.
- Saves results to file (if specified).
- Returns metrics as a dictionary (if specified).
Args:
actual (array-like): Ground-truth target values.
predictions (array-like): Model predictions.
model_name (str, optional): Name of the model (for display/record-keeping).
filename (str, optional): Path to an HTML file to save the results.
notes (str, optional): Additional notes to include in the saved file (if `filename` is provided).
return_metrics (bool, optional): If True, returns a dictionary of metrics. Defaults to False.
show_plots (bool, optional): If True, calls `plt.show()` for each figure. Defaults to False.
show_metrics (bool, optional): If True, prints the metrics and correlations to stdout. Defaults to True.
plots (bool, optional): If True, generates plots. Defaults to False.
round_digits (int, optional): Number of digits to round the metrics. Defaults to 3.
Returns:
dict or None:
A dictionary of computed metrics if `return_metrics=True`, otherwise None.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from scipy.stats import kendalltau, pearsonr, spearmanr
from datetime import datetime
from io import BytesIO
import base64
# Ensure inputs are NumPy arrays
= np.array(actual)
actual = np.array(predictions)
predictions
def save_figure_to_file(fig):
"""
Helper function:
Convert a Matplotlib figure to a base64-encoded PNG for embedding in HTML.
"""
= BytesIO()
tmpfile format='png')
fig.savefig(tmpfile, = base64.b64encode(tmpfile.getvalue()).decode('utf-8')
encoded return encoded
# 1. Calculate regression metrics
= round(r2_score(actual, predictions), round_digits)
r2 = round(mean_absolute_error(actual, predictions), round_digits)
mae = round(mean_absolute_percentage_error(actual, predictions), round_digits)
mape = round(mean_squared_error(actual, predictions), round_digits)
mse = round(np.sqrt(mean_squared_error(actual, predictions)), round_digits)
rmse
# 2. Calculate correlation metrics
= round(pearsonr(actual, predictions)[0], round_digits)
pearson = round(spearmanr(actual, predictions)[0], round_digits)
spearman = round(kendalltau(actual, predictions)[0], round_digits)
kendall
# 3. Print metrics if needed
if show_metrics:
print(f"Model: {model_name or 'N/A'}")
print(f"R2: {r2}")
print(f"MAE: {mae}")
print(f"MAPE: {mape}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"Pearson Correlation: {pearson}")
print(f"Spearman Rho: {spearman}")
print(f"Kendall Tau: {kendall}")
# 4. Generate plots if requested
if plots:
= actual - predictions
residuals
# (a) Predicted vs. Actual
= plt.figure()
fig1 ='k', alpha=0.7)
plt.scatter(actual, predictions, edgecolor"Actual")
plt.xlabel("Predicted")
plt.ylabel("Predicted vs. Actual")
plt.title(# add a diagonal line
min(), actual.max()], [actual.min(), actual.max()], 'k--', lw=2)
plt.plot([actual.if show_plots:
plt.show()= save_figure_to_file(fig1)
prediction_vs_actual
plt.close(fig1)
# (b) Residuals vs. Predicted
= plt.figure()
fig2 ='k', alpha=0.7)
plt.scatter(predictions, residuals, edgecolor=0, color='r', linestyle='--')
plt.axhline(y"Predicted")
plt.xlabel("Residual")
plt.ylabel("Residuals vs. Predicted")
plt.title(if show_plots:
plt.show()= save_figure_to_file(fig2)
residuals_vs_predicted
plt.close(fig2)
# (c) Distribution of Residuals
= plt.figure()
fig3 =30, edgecolor='k', alpha=0.7)
plt.hist(residuals, bins"Residual")
plt.xlabel("Count")
plt.ylabel("Distribution of Residuals")
plt.title(if show_plots:
plt.show()= save_figure_to_file(fig3)
residuals_distribution
plt.close(fig3)
# (d) Distribution of Predicted vs. Actual
= plt.figure()
fig4 =30, alpha=0.5, label="Actual", edgecolor='k')
plt.hist(actual, bins=30, alpha=0.5, label="Predicted", edgecolor='k')
plt.hist(predictions, bins"Value")
plt.xlabel("Count")
plt.ylabel("Distribution of Predicted vs. Actual")
plt.title(
plt.legend()if show_plots:
plt.show()= save_figure_to_file(fig4)
predicted_vs_actual_distribution
plt.close(fig4)
# 5. Save results to file (HTML) if requested
if filename:
with open(filename, "w") as f:
f"<html><body>\n")
f.write(f"<h2>Report generated: {datetime.now()}</h2>\n")
f.write(if model_name:
f"<h2>Model Name: {model_name}</h2>\n")
f.write(
if notes:
f"<h3>Notes:</h3>\n<p>{notes}</p>\n")
f.write(
"<h3>Metrics</h3>\n")
f.write(f"<b>R2:</b> {r2} <br>\n")
f.write(f"<b>MAE:</b> {mae} <br>\n")
f.write(f"<b>MAPE:</b> {mape} <br>\n")
f.write(f"<b>MSE:</b> {mse} <br>\n")
f.write(f"<b>RMSE:</b> {rmse} <br>\n")
f.write(
"<h3>Correlations</h3>\n")
f.write(f"Pearson: {pearson} <br>\n")
f.write(f"Spearman: {spearman} <br>\n")
f.write(f"Kendall Tau: {kendall} <br>\n")
f.write(
if plots:
"<h3>Plots</h3>\n")
f.write(f'<img src="data:image/png;base64,{prediction_vs_actual}"><br><br>\n')
f.write(f'<img src="data:image/png;base64,{residuals_vs_predicted}"><br><br>\n')
f.write(f'<img src="data:image/png;base64,{residuals_distribution}"><br><br>\n')
f.write(f'<img src="data:image/png;base64,{predicted_vs_actual_distribution}"><br><br>\n')
f.write(
"</body></html>\n")
f.write(
# 6. Optionally return a dictionary of metrics
if return_metrics:
return {
"model_name": model_name,
"notes": notes,
"r2": r2,
"mae": mae,
"mape": mape,
"mse": mse,
"rmse": rmse,
"pearson": pearson,
"spearman": spearman,
"kendall": kendall
}
="Linear Regression", filename="report.html", show_plots=True, plots=True) evaluate_regression(y_train, y_train_pred, model_name
Model: Linear Regression
R2: 0.851
MAE: 14.434
MAPE: 1.377
MSE: 352.374
RMSE: 18.772
Pearson Correlation: 0.923
Spearman Rho: 0.846
Kendall Tau: 0.683
="Linear Regression", filename="report.html", show_plots=True, plots=True) evaluate_regression(y_test, y_test_pred, model_name
Model: Linear Regression
R2: -1.17
MAE: 33.684
MAPE: 1.183
MSE: 6155.855
RMSE: 78.459
Pearson Correlation: 0.121
Spearman Rho: 0.623
Kendall Tau: 0.526
Cross Validation
https://scikit-learn.org/1.5/modules/cross_validation.html
1 2 3 4 5
1 2 3 4 | 5
1 2 3 5 | 4
1 2 4 5 | 3
Regulaization
Go to pdf ml\Chapter 1 Regression Main Concepts\PDF\L03 Regularization.pdf
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# ------------------------------------------------------------------------------
# 1. Generate Synthetic Data (True degree = 3)
# ------------------------------------------------------------------------------
509)
np.random.seed(
def true_function(x):
# Example: y = 1 + 2x + 3x^2 + 4x^3, plus random noise
return 1 + 2*x + 3*x**2 + 4*x**3 + np.random.normal(0, 20, size=x.shape)
def true_function_without_noise(x):
return 1 + 2*x + 3*x**2 + 4*x**3
# Generate data
= 100
N = np.linspace(-3, 3, N).reshape(-1, 1)
X = true_function(X.ravel())
y
# Split into train and test sets
= train_test_split(
X_train, X_test, y_train, y_test =0.3, random_state=42
X, y, test_size
)
# ------------------------------------------------------------------------------
# 2. Fix Polynomial Degree = 20, Vary Regularization Strength (alpha)
# ------------------------------------------------------------------------------
= 20
degree = "Lasso" # "Ridge" or "Lasso"
model_name = [0.001, 0.01, 0.1, 1, 10, 100, 10_000] # Example alpha (lambda) values
alpha_values
# Prepare polynomial features (degree=10) once
= PolynomialFeatures(degree=degree)
poly = poly.fit_transform(X_train)
X_train_poly = poly.transform(X_test)
X_test_poly
# A dense grid for plotting model predictions
= np.linspace(X.min(), X.max(), 200).reshape(-1, 1)
X_plot = poly.transform(X_plot)
X_plot_poly
= []
train_mses = []
test_mses = {}
predictions
# Fit a Ridge model for each alpha
for alpha_val in alpha_values:
if model_name == "Lasso":
= Lasso(alpha=alpha_val)
model else:
= Ridge(alpha=alpha_val)
model
model.fit(X_train_poly, y_train)
# Compute predictions and MSE for train / test
= model.predict(X_train_poly)
y_train_pred = model.predict(X_test_poly)
y_test_pred = mean_squared_error(y_train, y_train_pred)
train_mse = mean_squared_error(y_test, y_test_pred)
test_mse
train_mses.append(train_mse)
test_mses.append(test_mse)
# Store predictions on the plotting grid
= model.predict(X_plot_poly)
y_plot_pred = y_plot_pred
predictions[alpha_val]
# ------------------------------------------------------------------------------
# 3. Plot MSE vs. alpha
# ------------------------------------------------------------------------------
= {
df_mse 'Lambda (alpha)': alpha_values,
'Train MSE': train_mses,
'Test MSE': test_mses
}
= px.line(
fig_mse
df_mse,='Lambda (alpha)',
x=['Train MSE', 'Test MSE'],
y=True,
markers=f"MSE vs. Lambda (alpha) (Polynomial Degree = {degree})"
title
)# Optionally make the x-axis log scale for clarity:
# fig_mse.update_layout(xaxis_type='log')
="MSE")
fig_mse.update_layout(yaxis_title
# ------------------------------------------------------------------------------
# 4. Interactive Plot: Data + Fitted Polynomials (Slider over alpha)
# ------------------------------------------------------------------------------
= go.Scatter(
scatter_train =X_train.ravel(),
x=y_train,
y='markers',
mode='Train Data',
name=dict(color='blue', size=6)
marker
)
= go.Scatter(
scatter_test =X_test.ravel(),
x=y_test,
y='markers',
mode='Test Data',
name=dict(color='red', size=6)
marker
)
= true_function_without_noise(X_plot.ravel())
true_y = go.Scatter(
scatter_true =X_plot.ravel(),
x=true_y,
y='lines',
mode='True Function',
name=dict(dash='dash', color='black')
line
)
# Create frames for each alpha
= []
frames for alpha_val in alpha_values:
= go.Scatter(
line_pred =X_plot.ravel(),
x=predictions[alpha_val],
y='lines',
mode=dict(width=3),
line=f"alpha={alpha_val}"
name
)=[scatter_train, scatter_test, scatter_true, line_pred],
frames.append(go.Frame(data=str(alpha_val)))
name
# Initial line (use the first alpha in alpha_values)
= go.Scatter(
init_line =X_plot.ravel(),
x=predictions[alpha_values[0]],
y='lines',
mode=dict(width=3),
line=f"alpha={alpha_values[0]}"
name
)
= go.Figure(
fig_poly =[scatter_train, scatter_test, scatter_true, init_line],
data=go.Layout(
layout=f"Polynomial Degree={degree} with {model_name} Regularization (Slider: alpha)",
title=dict(title="X"),
xaxis=dict(title="y"),
yaxis=[
updatemenusdict(
type="buttons",
=False,
showactive=1.15,
x=1.15,
y="right",
xanchor="top",
yanchor=[
buttonsdict(label="Play",
="animate",
method=[None,
argsdict(frame=dict(duration=500, redraw=True),
=True,
fromcurrent=dict(duration=300))])
transition
]
)
],=[{
sliders'currentvalue': {'prefix': 'Lambda (alpha): ', 'xanchor': 'right'},
'steps': [
{'label': str(a),
'method': 'animate',
'args': [[str(a)],
dict(mode='immediate',
=dict(duration=300, redraw=True),
frame=dict(duration=300))]
transitionfor a in alpha_values
}
]
}]
),=frames
frames
)
# ------------------------------------------------------------------------------
# 5. Show the plots
# ------------------------------------------------------------------------------
fig_poly.show() fig_mse.show()
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.455e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.456e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.470e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.531e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.630e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.768e+04, tolerance: 1.657e+01
c:\Users\hayk_\.conda\envs\thesis\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.300e+04, tolerance: 1.657e+01
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Unable to display output for mime type(s): application/vnd.plotly.v1+json
model.coef_
array([ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.69823579e-03,
7.68505403e-05, 2.29958370e-05, 1.72953193e-05, -3.68322120e-05,
2.04327918e-07, -1.37290439e-06, -1.10147475e-07, 7.54009541e-08,
-1.97219436e-08])